Scale Space Technique for Word Segmentation in Handwritten Documents
Identifieur interne : 001F42 ( Main/Exploration ); précédent : 001F41; suivant : 001F43Scale Space Technique for Word Segmentation in Handwritten Documents
Auteurs : R. Manmatha [États-Unis] ; Nitin Srimal [États-Unis]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 1999.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
Abstract: Indexing large archives of historical manuscripts, like the pa- pers of George Washington, is required to allow rapid perusal by scholars and researchers who wish to consult the original manuscripts. Presently, such large archives are indexed manually. Since optical character recog- nition (OCR) works poorly with handwriting, a scheme based on match- ing word images called word spotting has been suggested previously for indexing such documents. The important steps in this scheme are seg- mentation of a document page into words and creation of lists containing instances of the same word by word image matching. We have developed a novel methodology for segmenting handwritten document images by analyzing the extent of “blobs” in a scale space representationof the image. We believe this is the first application of scale space to this problem. The algorithm has been applied to around 30 grey level images randomly picked from different sections of the George Washington corpus of 6,400 handwritten document images. An accuracy of 77 – 96 percent was observed with an average accuracy of around 87 percent. The algorithm works well in the presence of noise, shine through and other artifacts which may arise due aging and degradation of the page over a couple of centuries or through the man made processes of photocopying and scanning.
Url:
DOI: 10.1007/3-540-48236-9_3
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001049
- to stream Istex, to step Curation: 000F97
- to stream Istex, to step Checkpoint: 001493
- to stream Main, to step Merge: 002051
- to stream PascalFrancis, to step Corpus: 000795
- to stream PascalFrancis, to step Curation: 000B99
- to stream PascalFrancis, to step Checkpoint: 000761
- to stream Main, to step Merge: 002142
- to stream Main, to step Curation: 001F42
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Scale Space Technique for Word Segmentation in Handwritten Documents</title>
<author><name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
</author>
<author><name sortKey="Srimal, Nitin" sort="Srimal, Nitin" uniqKey="Srimal N" first="Nitin" last="Srimal">Nitin Srimal</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:C7DBB4B04AF1CCAA1EC1C3512BA05075940CB2AD</idno>
<date when="1999" year="1999">1999</date>
<idno type="doi">10.1007/3-540-48236-9_3</idno>
<idno type="url">https://api.istex.fr/document/C7DBB4B04AF1CCAA1EC1C3512BA05075940CB2AD/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001049</idno>
<idno type="wicri:Area/Istex/Curation">000F97</idno>
<idno type="wicri:Area/Istex/Checkpoint">001493</idno>
<idno type="wicri:doubleKey">0302-9743:1999:Manmatha R:scale:space:technique</idno>
<idno type="wicri:Area/Main/Merge">002051</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:99-0517474</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000795</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B99</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000761</idno>
<idno type="wicri:doubleKey">0302-9743:1999:Manmatha R:scale:space:technique</idno>
<idno type="wicri:Area/Main/Merge">002142</idno>
<idno type="wicri:Area/Main/Curation">001F42</idno>
<idno type="wicri:Area/Main/Exploration">001F42</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Scale Space Technique for Word Segmentation in Handwritten Documents</title>
<author><name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
<affiliation wicri:level="4"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science Department, University of Massachusetts, 01003, Amherst, MA</wicri:regionArea>
<placeName><region type="state">Massachusetts</region>
<settlement type="city">Amherst (Massachusetts)</settlement>
</placeName>
<orgName type="university">Université du Massachusetts</orgName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Srimal, Nitin" sort="Srimal, Nitin" uniqKey="Srimal N" first="Nitin" last="Srimal">Nitin Srimal</name>
<affiliation wicri:level="4"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science Department, University of Massachusetts, 01003, Amherst, MA</wicri:regionArea>
<placeName><region type="state">Massachusetts</region>
<settlement type="city">Amherst (Massachusetts)</settlement>
</placeName>
<orgName type="university">Université du Massachusetts</orgName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>1999</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">C7DBB4B04AF1CCAA1EC1C3512BA05075940CB2AD</idno>
<idno type="DOI">10.1007/3-540-48236-9_3</idno>
<idno type="ChapterID">3</idno>
<idno type="ChapterID">Chap3</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Binary image</term>
<term>Document analysis</term>
<term>Grey level image</term>
<term>Hand writing</term>
<term>Manuscript character</term>
<term>Optical character recognition</term>
<term>Segmentation</term>
<term>Word</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Analyse documentaire</term>
<term>Caractère manuscrit</term>
<term>Ecriture</term>
<term>Image binaire</term>
<term>Image niveau gris</term>
<term>Mot</term>
<term>Reconnaissance optique caractère</term>
<term>Segmentation</term>
<term>Segmentation ligne</term>
<term>Segmentation mot</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Indexing large archives of historical manuscripts, like the pa- pers of George Washington, is required to allow rapid perusal by scholars and researchers who wish to consult the original manuscripts. Presently, such large archives are indexed manually. Since optical character recog- nition (OCR) works poorly with handwriting, a scheme based on match- ing word images called word spotting has been suggested previously for indexing such documents. The important steps in this scheme are seg- mentation of a document page into words and creation of lists containing instances of the same word by word image matching. We have developed a novel methodology for segmenting handwritten document images by analyzing the extent of “blobs” in a scale space representationof the image. We believe this is the first application of scale space to this problem. The algorithm has been applied to around 30 grey level images randomly picked from different sections of the George Washington corpus of 6,400 handwritten document images. An accuracy of 77 – 96 percent was observed with an average accuracy of around 87 percent. The algorithm works well in the presence of noise, shine through and other artifacts which may arise due aging and degradation of the page over a couple of centuries or through the man made processes of photocopying and scanning.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Massachusetts</li>
</region>
<settlement><li>Amherst (Massachusetts)</li>
</settlement>
<orgName><li>Université du Massachusetts</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Massachusetts"><name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
</region>
<name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
<name sortKey="Srimal, Nitin" sort="Srimal, Nitin" uniqKey="Srimal N" first="Nitin" last="Srimal">Nitin Srimal</name>
<name sortKey="Srimal, Nitin" sort="Srimal, Nitin" uniqKey="Srimal N" first="Nitin" last="Srimal">Nitin Srimal</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001F42 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001F42 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:C7DBB4B04AF1CCAA1EC1C3512BA05075940CB2AD |texte= Scale Space Technique for Word Segmentation in Handwritten Documents }}
This area was generated with Dilib version V0.6.32. |